length generalization
894403f9604374a7a003063e480f65b9-Paper-Conference.pdf
Transformers have theoretical limitations in modeling certain sequence-to-sequence tasks, yet it remains largely unclear if these limitations play a role in large-scale pretrained LLMs, or whether LLMs might effectively overcome these constraints in practice due to the scale of both the models themselves and their pretraining data. We explore how these architectural constraints manifest after pretraining, by studying a family of retrieval and copying tasks inspired by Liu et al. [2024a]. We use a recently proposed framework for studying length generalization [Huang et al., 2025] to provide guarantees for each of our settings.
Hardware-aligned Hierarchical Sparse Attention for Efficient Long-term Memory Access
A key advantage of Recurrent Neural Networks (RNNs) over Transformers is their linear computational and space complexity enables faster training and inference for long sequences. However, RNNs are fundamentally unable to randomly access historical context, and simply integrating attention mechanisms may undermine their efficiency advantages. To overcome this limitation, we propose Hierarchical Sparse Attention (HSA), a novel attention mechanism that enhances RNNs with long-range random access flexibility while preserving their merits in efficiency and length generalization. HSA divides inputs into chunks, selects the top-k chunks and hierarchically aggregates information. The core innovation lies in learning token-to-chunk relevance based on fine-grained token-level information inside each chunk. This approach enhances the precision of chunk selection across both in-domain and out-of-domain context lengths. To make HSA efficient, we further introduce a hardware-aligned kernel design. By combining HSA with Mamba, we introduce RAMba, which achieves perfect accuracy in passkey retrieval across 64 million contexts despite pre-training on only 4K-length contexts, and significant improvements on various downstream tasks, with nearly constant memory footprint. These results show RAMba's huge potential in long-context modeling.
Length Generalization via Auxiliary Tasks
Length generalization, the ability of sequence models to generalize to sequences longer than those encountered during training, remains a key challenge for transformers, especially in tasks requiring algorithmic reasoning. Existing theoretical understanding of length generalization is limited, often providing only asymptotic results or focusing on specific problem classes or architectural variants, while empirical approaches frequently rely on ad hoc and often fragile techniques. In this work we introduce a novel framework for analyzing and proving length generalization bounds under specified, verifiable assumptions. A key outcome of the theory is the identification of a natural set of auxiliary tasks, intricately related to the primary task structure, such that strong performance on these auxiliary tasks, alongside the primary task, provably guarantees length generalization within the framework. This motivates a multi-task training procedure that explicitly optimizes performance on both the primary and the identified auxiliary tasks. Empirical evaluations on a variety of synthetic benchmarks known to be challenging for length generalization, including sequence sorting, and reversal, demonstrate that our proposed method yields significant improvements in generalization to substantially longer sequences.
Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs
Length generalization--the ability to solve problems longer than those seen during training--remains a critical challenge for large language models (LLMs). Previous work modifies positional encodings (PEs) and data formats to improve length generalization on specific symbolic tasks such as addition and sorting. However, these approaches are fundamentally limited to special tasks, often degrading general language performance. Furthermore, they are typically evaluated on small transformers trained from scratch on single tasks and can cause performance drop when applied during post-training stage of practical LLMs with general capabilities. Hu et al. [19] proposed Rule-Following Fine-Tuning (RFFT) to improve length generalization in the post-training stage of LLMs.
Beyond Single-Task: Robust Multi-Task Length Generalization for LLMs
Length generalization--the ability to solve problems longer than those seen during training--remains a critical challenge for large language models (LLMs). Previous work modifies positional encodings (PEs) and data formats to improve length generalization on specific symbolic tasks such as addition and sorting. However, these approaches are fundamentally limited to special tasks, often degrading general language performance. Furthermore, they are typically evaluated on small transformers trained from scratch on single tasks and can cause performance drop when applied during post-training stage of practical LLMs with general capabilities. Hu et al., (2024) proposed Rule-Following Fine-Tuning (RFFT) to improve length generalization in the post-training stage of LLMs. Despite its compatibility with practical models and strong performance, RFFT is proposed for single tasks too, requiring re-training for each individual task with extensive examples. In this paper, we study length generalization in multi-task settings and propose, the first framework enabling robust length generalization.
PositionCoupling: ImprovingLengthGeneralization ofArithmeticTransformersUsingTaskStructure
Humans can length-generalize in integer addition because they understand the essential principle of the task. Nevertheless, itisobserved that Transformers typically learn to solve addition only up to the training sequence length (Lee et al., 2024), which is different from thetruearithmetic algorithm thathumans "implement".